HTML Tags Processed by Packager
WebCD Packager parses the following HTML tags out of a downloaded HTML file:
Base tag information is used to resolve relative URLs in the page:
<BASE HREF = "url">
This tag is removed unless there are attributes other than HREF in the tag. In this case the HREF attribute is removed from the tag.
Form Actions
Form actions are always added as links and default to Do Not Retrieve:
<FORM ACTION = "url">
Note The form itself is retrieved.
URLs Beneath the Path of the Starting URL
The following URLs are marked for retrieval and retrieved if they fall within
the path prefix of an URL that is marked for retrieval (i.e., the parent is
marked for retrieval); otherwise, they default to Do Not Retrieve:
<A HREF = "url">
<AREA HREF = "url">
<EMBED SRC = "url1" PLUGINSPAGE = "url2">
Note url1 defaults to Retrieve. url2 defaults to Retrieve if its parent is Retrieve.
<FRAME SRC = "url">
<IFRAME SRC= "url">
<LINK HREF = "url">
<META HTTP-EQUIV = "Refresh" CONTENT = "n; URL = "url"">
<OBJECT CODE = "url1" CODEBASE = "url2">
Note url1 and url2 are combined to form one URL. Neither has to exist. If only one exists, then
that URL is used.
<SCRIPT SRC= "url">
Exception Server side image maps default to Do Not Retrieve. The img-url is retrieved (see tag below).
<A HREF="map-url"><IMG SRC="img-url" ISMAP></A>
Server side image map conversion option in WebCD Packager. The absence of a USEMAP attribute (as shown in the tag directly above),
indicates a server side image map. When WebCD Packager detects this situation, an
URL for the image map file is automatically added to the retrieval specification
(Retrieval View). The URL added is a file URL pointing to the projectÆs Maps
folder, using the map-urlÆs base name. See Converting Server Side Image Maps to Client Side for more details on server side image map conversion.
URLs that are Always Retrieved
The following URLs are always retrieved, whether or not they are within the
path of the starting URL:
<BGSOUND SRC = "url">
<BODY BACKGROUND = "url">
<IMG SRC = "url" DYNSRC = "url" LOWSRC = "url">
<INPUT SRC = "url">
<TABLE BACKGROUND = "url">
<TD BACKGROUND = "url">
<TH BACKGROUND = "url">
<TR BACKGROUND = "url">
Title Text
Title text is stored as the pageÆs title. Leading white space, trailing white
space, and non-printable characters are ignored:
<TITLE> and </TITLE>
Header Information
WebCD Packager uses HTTP header information to determine whether an URL is
redirected. URLs that are redirected outside the path of the starting URL default
to Do Not Retrieve.
Note More tags will be processed in future releases of WebCD Packager.
Related Topics
Converting Server Side Image Maps to Client Side
HTTP Redirection
Omitting/Inserting HTML in the Packager Process